What is Bloom Filter?

A Bloom Filter is a probabilistic data structure used to test whether an element is a member of a set. It is designed to be very fast and extremely space-efficient, especially when working with large volumes of data where memory is constrained.

3 items9/32 bits2.2% FP rate

applebananacherry

However, it comes with one trade-off:

It may produce false positives (saying an element is in the set when it actually isnâ€™t).
But it never produces false negatives (if it says an element is not present, itâ€™s guaranteed to be absent).

In this chapter, we will explore the low-level design of bloom filter in detail.

Lets start by clarifying the requirements:

1. Clarifying Requirements

Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.

Here is an example of how a discussion between the candidate and the interviewer might unfold:

Discussion

Candidate: Should the Bloom Filter support generic types (e.g., any object), or should we limit it to specific data types like strings or integers

Interviewer: Letâ€™s assume for this design that weâ€™re working with strings.

Candidate: What operations should the Bloom Filter support? Just add and mightContain, or do we also need to support deletion

Interviewer: Just add and mightContain. Deletion is not required in a standard Bloom Filter.

Candidate: What kind of false result is acceptable? Are we okay with false positives?

Interviewer: Yes, thatâ€™s expected. A Bloom Filter can return false positives, but it should never return false negatives.

Candidate: Should the number of hash functions (k) and size of the bit array (m) be configurable or fixed?

Interviewer: Ideally, they should be configurable at initialization. The values may vary depending on the expected number of elements and acceptable false positive rate.

Candidate: What should be the behavior if the same element is added multiple times?

Interviewer: Thatâ€™s fine. Bloom Filters are idempotent. Adding the same element again should have no effect on correctness.

Candidate: Should the design support concurrency? For example, multiple threads calling add and mightContain?

Interviewer: Yes. Assume the Bloom Filter will be accessed from a multi-threaded environment.

Based on the discussion, hereâ€™s a summary of the functional requirements:

1.1 Functional Requirements

2 3 2 3 4 5 67 8 9 10 11 12 13 14 15 16 17 1819 20 21 22 23 24 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 5354 class=ml-4>

Support add(element): Add an element to the set.
Support mightContain(element): Check if an element might be in the set.
The filter should operate on strings as the input data type.
Support configurable values of k (number of hash functions) and m (size of the bit array).

1.2 Non-Functional Requirements

Space Efficiency: The Bloom Filter should use significantly less memory than storing the actual elements
High Performance: Both add and mightContain operations should be optimized for speed, ideally executing in O(1) time.
Thread-Safety: The implementation should be safe to use in a concurrent environment.

2. Identifying Core Entities

The design of a Bloom Filter is fundamentally different from typical object-oriented systems that model real-world entities. Instead of focusing on domain objects, we focus on choosing the right low-level data structures and abstractions that enable fast, memory-efficient, and thread-safe operations.

Our goal is to support two operations add(element) and mightContain(element) both in O(1) time, using minimal memory, while being safe for use in multi-threaded environments.

To meet these requirements, we must combine three key components:

A bit array to track element presence
A set of hash functions to map inputs to bit positions
A main coordinating class that exposes the public API and manages the internal logic

Letâ€™s break down these components and understand how they interact.

1. Bit Array (Data Store)

The first and most fundamental requirement is space efficiency. Since we are explicitly told not to store the actual elements, we need a compact representation of which elements have been seen.

This leads us to a bit array â€” a one-dimensional array where each element is a single bit (0 or 1).

Initially, all bits are set to 0.
When an element is added, multiple hash functions map it to k bit positions, and each of those bits is set to 1.
When checking for presence, the same k bits are examined. If all are 1, the element might be in the set. If any bit is 0, it is definitely not.

In Java, we would typically use java.util.BitSet for this purpose, as it provides a compact, memory-efficient implementation and utilities for bit manipulation.

2. Hash Functions

To populate and check the bit array, we need a way to map each element to one or more bit positions. This is the role of our hash functions.

A standard Bloom Filter uses k independent hash functions, each of which maps a string input to a bit array index between 0 and m - 1.

The same set of hash functions must be used in both add() and mightContain() operations.
Multiple hash functions reduce the chance of collisions and help maintain a low false positive rate.

In our Bloom Filter, we will maintain an array or list of k HashFunction implementations, each producing a different hash value for the same input.

3. BloomFilter

Finally, we need a public-facing class that ties everything together and exposes a clean API to the user. This is the BloomFilter class.

It acts as the coordinator or facade, hiding the internal complexity of hashing and bit manipulation while providing two easy-to-use methods:

add(String element)
mightContain(String element): boolean

Summary of Core Entities

Bit Array: A fixed-size array of m bits representing the filterâ€™s state. Supports fast setting and checking of bits.
Hash Functions: A group of k independent hash functions used to map string elements to indices in the bit array.
BloomFilter: The main class that exposes add() and mightContain() and manages internal data structures.

Together, these core components allow us to build a Bloom Filter that is compact, and fast, all while delivering the required functionality and performance.

3. Designing Classes and Relationships

Now that we've identified the core components of a Bloom Filter, it's time to convert those ideas into a structured class design. This includes defining the responsibilities and attributes of each class, establishing relationships between them, applying design patterns for clean extensibility, and finally visualizing the system in a class diagram.

3.1 Class Definitions

We begin by defining the key classes involved in the system, starting with data structures and utilities, and culminating in the main interface.

Enums

`BloomFilter`

This is the main class that exposes the public API (add and mightContain) and orchestrates the behavior of the internal components.

Attributes:

bitArray: BitSet â€“ The core data store that tracks presence via bits.
hashFunctions: List<HashFunction> â€“ A list of k independent hash functions.
size: int â€“ The total size m of the bit array.

Methods:

BloomFilter(int size, List<HashFunction> hashFunctions) â€“ Constructor to initialize size and hash functions.
void add(String element) â€“ Applies all hash functions to the input and sets the corresponding bits.
boolean mightContain(String element) â€“ Applies all hash functions and checks if all relevant bits are set.

`HashFunction` (Interface)

Defines the contract for hash functions used by the Bloom Filter.

Method:

int hash(String input, int seed, int bound) â€“ Computes a hash of the input string, using a seed, and returns a result in the range [0, bound).

This allows support for multiple independent hash functions by using different seeds.

3.2 Class Relationships

Understanding how these classes interact clarifies system behavior and responsibilities.

Composition

BloomFilter has-a BitSet
BloomFilter has-a List<HashFunction>

The BloomFilter owns and manages these components. Their lifecycle is tied to the lifecycle of the filter.

Association

BloomFilter uses HashFunction to compute bit positions.
HashFunction is an abstraction that allows the filter to be configured with different implementations.

3.3 Key Design Patterns

We apply a few classic design principles and patterns to keep the system modular, flexible, and testable.

Strategy Pattern

The HashFunction interface and its implementations follow the Strategy Pattern. This allows us to inject different hashing strategies into the filter at runtime, based on the use case or performance characteristics.

Example: We can replace DefaultHashFunction with MurmurHashFunction , Djb2Hashwithout changing the BloomFilter class.

Factory Pattern

Creating a Bloom Filter requires complex calculations to determine the optimal bit array size (m) and number of hash functions (k) based on user-friendly inputs (expected elements and desired false positive rate). We provide a public static factory method BloomFilter.create(...) that encapsulates this complexity, offering a much cleaner API to the client than a constructor that requires m and k directly.

Builder Pattern

Facade Pattern (Implicit)

The BloomFilter class acts as a facade, exposing a simplified interface (add and mightContain) while hiding all internal details related to hashing, bit manipulation, and concurrency handling.

This simplifies the usage for clients and makes the internal system easier to evolve without breaking external code.

3.4 Full Class Diagram

4. Implementation

4.1 `HashType` Enum

1class HashType(Enum): FNV1A = "FNV1A" DJB2 = "DJB2"

Defines the set of supported hash function types. This allows runtime flexibility in choosing hash algorithms and enables extensibility by adding new types later.

4.2 `HashStrategy` Interface and Implementations

Defines a contract for all hashing algorithms used in the Bloom Filter. Any strategy must implement a consistent way to hash strings into long integers.

1class HashStrategy(ABC): @abstractmethod def hash(self, data: str) -> int: pass class=token style=color:rgb(139,233,253)>class FNV1aHashStrategy(HashStrategy): # FNV-1a 64-bit constants FNV_PRIME = 0x100000001b3 FNV_OFFSET_BASIS = 0xcbf29ce484222325 def hash(self, data: str) -> int: hash_value = self.FNV_OFFSET_BASIS for byte in data.encode('utf-8'): hash_value ^= byte hash_value *= self.FNV_PRIME return hash_value class=token style=color:rgb(139,233,253)>class DJB2HashStrategy(HashStrategy): def hash(self, data: str) -> int: hash_value = 5381 for byte in data.encode('utf-8'): # hash = hash * 33 + c hash_value = ((hash_value << 5) + hash_value) + byte return hash_value

FNV1aHashStrategy: Implements the FNV-1a (64-bit) hashing algorithm. It's fast, simple, and has good dispersion for small strings.
DJB2HashStrategy: Implements the DJB2 hash function, known for simplicity and relatively low collision rate.

4.3 `HashStrategyFactory`

1class HashStrategyFactory: @staticmethod def create(hash_type: HashType) -> HashStrategy: if hash_type == HashType.FNV1A: return FNV1aHashStrategy() elif hash_type == HashType.DJB2: return DJB2HashStrategy() else: raise ValueError(f"Unsupported hash type: {hash_type}")

Implements the Factory Pattern to provide the appropriate hashing strategy based on the enum value. Decouples strategy instantiation from client code.

4.4 `BloomFilter` Class

This is the core Bloom Filter class.

1class BloomFilter: def __init__(self, bit_set_size: int, num_hash_functions: int, strategies: list[HashStrategy]): self.bit_set_size = bit_set_size self.num_hash_functions = num_hash_functions self.bit_set = bitarray(bit_set_size) self.bit_set.setall(0) self.hash_strategies = strategies def add(self, item: str): for i in range(self.num_hash_functions): hash_value = self.hash_strategies[i].hash(item) index = abs(hash_value) % self.bit_set_size self.bit_set[index] = 1 def might_contain(self, item: str) -> bool: for i in range(self.num_hash_functions): hash_value = self.hash_strategies[i].hash(item) index = abs(hash_value) % self.bit_set_size if not self.bit_set[index]: return False  # Definitely not in the set return True  # Might be in the set class Builder: def __init__(self): self.bit_set_size = 0 self.num_hash_functions = 0 self.strategies = None def with_bit_set_size(self, bit_set_size: int): if bit_set_size <= 0: raise ValueError("Bit set size must be positive.") self.bit_set_size = bit_set_size return self def with_num_hash_functions(self, num_hash_functions: int): if num_hash_functions <= 0: raise ValueError("Number of hash functions must be positive.") self.num_hash_functions = num_hash_functions return self def with_hash_strategies(self, strategies: list[HashStrategy]): if strategies is None or len(strategies) == 0: raise ValueError("At least one hash strategy must be provided.") self.strategies = strategies return self def build(self) -> 'BloomFilter': if self.bit_set_size == 0 or self.num_hash_functions == 0 or self.strategies is None: raise ValueError("Must set bit set size, number of hash functions, and strategies.") if len(self.strategies) < self.num_hash_functions: raise ValueError( f"The number of provided hash strategies ({len(self.strategies)}) " f"must be at least equal to the number of hash functions required ({self.num_hash_functions})." ) print(f"Creating Bloom Filter with specified parameters:") print(f"  - Bit set size (m): {self.bit_set_size}") print(f"  - Hash functions (k): {self.num_hash_functions}") return BloomFilter(self.bit_set_size, self.num_hash_functions, self.strategies)

bitSet: The binary array (bitmap) representing membership.
numHashFunctions: Controls how many different hashes are computed for each item.
hashStrategies: A list of HashStrategy instances used to simulate multiple independent hash functions.

`add(String item)`

Inserts an element into the Bloom filter by setting k bits in the bit array. Each hash function produces a position; that bit is set to 1.

`mightContain(String item)`

Checks for probable presence by verifying all bits that would have been set for the element.If any bit is not set, the element is definitely not present (no false negatives). If all bits are set, the item might be present (possible false positives).

Builder Pattern

Constructing a BloomFilter requires several parameters that must be set correctly. The Builder pattern provides a fluent, readable API for this configuration. It also allows us to perform validation within the build() method, ensuring that no BloomFilter object is ever created in an invalid state.

4.6 Example Usage: `BloomFilterDemo`

The BloomFilterDemo class demonstrates how a client would use the system, highlighting its core properties: the guarantee of no false negatives and the probability of false positives.

1class BloomFilterDemo: @staticmethod def main(): # --- 1. Manually define parameters --- bit_set_size = 10000 num_hash_functions = 2 expected_insertions = 1000 # --- 2. Create a list of hash strategies at runtime --- # We use the Factory to get base strategies and the Decorator to create unique variations. strategies = [ HashStrategyFactory.create(HashType.FNV1A), HashStrategyFactory.create(HashType.DJB2) ] # --- 3. Build the filter using the new Builder syntax --- filter = BloomFilter.Builder() \ .with_bit_set_size(bit_set_size) \ .with_num_hash_functions(num_hash_functions) \ .with_hash_strategies(strategies) \ .build() # --- 4. Add elements to the filter --- print("\n--- Adding elements to the filter ---") inserted_elements = [] for i in range(expected_insertions): element = f"user{i}@example.com" inserted_elements.append(element) filter.add(element) print(f"{expected_insertions} elements have been added.") # --- 5. Test for presence (no false negatives) --- print("\n--- Verifying no false negatives ---") has_false_negatives = False for element in inserted_elements: if not filter.might_contain(element): print(f"FALSE NEGATIVE DETECTED FOR: {element}", file=sys.stderr) has_false_negatives = True break if not has_false_negatives: print("Success! No false negatives found. All inserted elements were detected.") # --- 6. Test for false positives --- print("\n--- Testing for false positives ---") test_set_size = 10000 false_positives_count = 0 for i in range(test_set_size): random_element = str(uuid.uuid4()) if filter.might_contain(random_element): false_positives_count += 1 print(f"Number of false positives found: {false_positives_count} out of {test_set_size} random items.") class=token style=color:rgb(139,233,253)>if __name__ == "__main__": BloomFilterDemo.main()

This driver code demonstrates a complete workflow:

Configuration: The filter is configured and built using the fluent Builder API.
Population: A set of known elements is added to the filter.
Verification: The demo programmatically verifies the Bloom Filter's primary guarantee: that it will never return false for an element that has been added (no false negatives).
Demonstration of Trade-off: It then tests a large number of random elements that were never added to show that a small percentage are incorrectly identified as being present (false positives). This clearly illustrates the space-vs-accuracy trade-off inherent in a Bloom Filter.

5. Run and Test

Languages

Java

Python

C++

Files7

enum

factory

strategies

bloom_filter_demo.py

main

bloom_filter.py

bloom_filter_demo.py

class BloomFilterDemo:
    @staticmethod
    def main():
        # --- 1. Manually define parameters ---
        bit_set_size = 10000
        num_hash_functions = 2
        expected_insertions = 1000
        # --- 2. Create a list of hash strategies at runtime ---
        # We use the Factory to get base strategies and the Decorator to create unique variations.
        strategies = [
            HashStrategyFactory.create(HashType.FNV1A),
            HashStrategyFactory.create(HashType.DJB2)
        ]
        # --- 3. Build the filter using the new Builder syntax ---
        filter = BloomFilter.Builder() \
            .with_bit_set_size(bit_set_size) \
            .with_num_hash_functions(num_hash_functions) \

Output

6. Quiz

Design Bloom Filter Quiz

1 / 20

Multiple Choice

What is the primary reason for using a bit array in the design of a Bloom Filter?

How helpful was this article?

Comments (1)

0/2000

Sort by

h06ag.official7 days ago

For the question "Which class relationship best describes the connection between the Bloom Filter and its hash function(s)?"

I think the answer would not be Aggregation, it should be just "Composition"

Design Bloom Filter

Ashish Pratap Singh

What is Bloom Filter?

1. Clarifying Requirements

1.1 Functional Requirements

1.2 Non-Functional Requirements

2. Identifying Core Entities

1. Bit Array (Data Store)

2. Hash Functions

3. BloomFilter

Summary of Core Entities

3. Designing Classes and Relationships

3.1 Class Definitions

Enums

BloomFilter

Attributes:

Methods:

HashFunction (Interface)

3.2 Class Relationships

Composition

Association

3.3 Key Design Patterns

Strategy Pattern

Factory Pattern

Builder Pattern

Facade Pattern (Implicit)

3.4 Full Class Diagram

4. Implementation

4.1 HashType Enum

4.2 HashStrategy Interface and Implementations

4.3 HashStrategyFactory

4.4 BloomFilter Class

add(String item)

mightContain(String item)

Builder Pattern

4.6 Example Usage: BloomFilterDemo

5. Run and Test

6. Quiz

Design Bloom Filter Quiz

How helpful was this article?

Comments (1)

`BloomFilter`

`HashFunction` (Interface)

4.1 `HashType` Enum

4.2 `HashStrategy` Interface and Implementations

4.3 `HashStrategyFactory`

4.4 `BloomFilter` Class

`add(String item)`

`mightContain(String item)`

4.6 Example Usage: `BloomFilterDemo`